QSAR study of phenolic compounds and their anti-DPPH radical activity by discriminant analysis

Phenolic compounds (PCs) could be applied to reduce reactive oxygen species (ROS) levels, and are used to prevent and treat diseases related to oxidative stress. QSAR study was applied to elucidate the relationship between the molecular descriptors and physicochemical properties of polyphenol analogues and their DPPH radical scavenging capability, to guide the design and discovery of highly-potent antioxidant substances more efficiently. PubMed database was used to collect 99 PCs with antioxidant activity, whereas, 105 negative PCs were found in ChEMBL database; their molecular descriptors were generated with Python's Rdkit package. While the molecular descriptors significantly related to the antioxidant activity of PCs were filtered by t-test. The prediction QSAR model was then established by discriminant analysis, and the obtained model was verified by the back-substitution and Leave-One-Out cross-validation methods along with heat map. It was revealed that the anti-DPPH radical activity of PCs was correlated with the drug-likeness and molecular fingerprints, physicochemical, topological, constitutional and electronic property. The established QSAR model could explicitly predict the antioxidant activity of polyphenols, thus were applicable to evaluate the potential of candidates as antioxidants.

Oxidative stress (OS) is the imbalance of redox reactions, which leads to the production of reactive oxygen species (ROS) 1 . The accumulation of ROS in the body or cells causes cytotoxic reaction, thus inducing a variety of pathological injuries, such as cardiovascular diseases, diabetes, tumors and other chronic diseases [2][3][4][5][6] . Therefore, it is feasible to initiate disease treatment with antioxidants to reverse this imbalance by reducing ROS level 7 . In this context, various measures to increase the use of antioxidants have been employed, and source from nature has become an important source of research and development of green and safe antioxidants 5 .
In recent years, a number of nature antioxidant substances, including crude extracts, bioactive partitions, chemical entities, have been discovered from insects [8][9][10][11][12] . As a continuous study of medicinal insects, five new phenolic compounds (PCs) with DPPH radical scavenging activity were isolated from the medicinal insect Blaps rynchopetera (Fairmaire), a local Chinese medicine, and their antioxidant activities were equivalent to vitamin C 13 . However, difficulties in separation and enrichment of above-mentioned PCs impeded further development of these lead compounds 14 . The design and synthesis of analogues of PCs from B. rynchopetera, would be a feasible alternative to more efficiently and economically prepare antioxidants.
Quantitative structure-activity relationship (QSAR) study is a powerful in-silico method in terms of design and discovery of bioactive compounds 15 . The presented work herein aimed to establish predictive QSAR model of PCs and their antioxidant activity with their 2D-structures and physicochemical properties as potential predictors. Multiple linear regression and discriminant analysis are mature multivariate statistical methods with reported application in the establishment of QSAR models [16][17][18][19] . Multiple linear regression requires assumptions of homoscedasticity and independent errors, which would likely be violated by data derived from different  20,21 . Thus, compounds from different sources could be incorporated together to fit QSAR model. Therefore, discriminant analysis was adopted to establish predictive models of PCs' molecular characteristics and their antioxidant activity for further molecular design for the discovery and development of efficient antioxidants.

Data preparation and methods
The strategy of modeling and the selection of molecular descriptors were shown in Scheme 1: (1) Literature retrieval and data collection; (2) Filtering and generating molecular descriptors; (3) Model establishment and fitting evaluation.
Collection and preparation of compounds. DPPH (Compound 1) radical scavenging assay is a valid and widely-accepted method for evaluating molecular antioxidant activities. The antioxidant activity was calculated by the rate of DPPH scavenging as Formula (1), and the 50% inhibitory concentration (IC 50 ) or 50% effective concentration (EC 50 ) values were regarded as indicator for molecular antioxidant activity.
where A 0 and A are the absorbances in the absence and in the presence of antioxidant, respectively. The absorbance reads for substance could be varied experiment by experiment. Compounds with values of IC 50 /EC 50 greater than 300 μM are hardly worthy for further development, therefore they would not be treated as lead compounds. Thus, 300 μM was applied as the cut point to differentiate positive and negative samples. Phenolic compounds (PCs) were derivatives with hydroxy-containing substitutions on aromatic ring of phenol (Compound 2). The key words of DPPH, phenolic compounds, and IC 50 /EC 50 values were applied for literature search in the PubMed database. Meanwhile, phenol was used as the keyword to search for negative samples in ChEMBL database.
The overall quality control exclusive criteria for positive samples included the followings: (1) a clear positive control could not be found in the original literature; (2) the IC 50 /EC 50 values of the same positive control substances from different articles were obviously inconsistent. The exclusive criteria for negative samples were: (1) the IC 50 /EC 50 values were less than 300 μM in the literature; (2) compound's DPPH scavenging activity was described though without generalizable IC 50 /EC 50 values.
(1) DPPH radical scavenging activity (%) = The red area is the process of data mining, yellow is the process of filtering molecular descriptor, and blue is the process of model fitting evaluation. Determination of molecular descriptors. The 200 molecular descriptors of well-defined, easy-interpretable, and most-often used in QSAR were generated by applying Python's Rdkit package 22 . These molecular descriptors mainly fall into the following categories namely: drug-likeness and molecular fingerprints, physicochemical, topological, constitutional and electronic property 23 . The semi-constant (more than 80% data with the same value) descriptors were excluded to eliminate redundant information and increase power-of-test, consequently, a total of 122 descriptors were retained for later QSAR modeling.

QSAR model development.
The performance of the obtained models mainly depends on their modeling descriptors. The model developed with improper descriptors would lead to either over-fitting or predict weakly and would be useless 23 . The t-test was then used to initially filter molecular descriptors that were less relevant with antioxidant activity. The primary objective of discriminant analysis was to build QSAR models that could predict the antioxidant activity of unknown compounds more reliably and precisely. These two statistical analysis methods were performed in SPSS 25.0, and the significance level was set as 0.05.
Evaluation of model fitting. Two validation techniques along with heat map were applied to assess the accuracy and robustness of the established QSAR model. The following analyses were carried out with SPSS 25.0 and Microsoft® Excel® 2019.
(1) Back-substitution method By comparing the predicted classification of the discriminant function and the actual classification, the correct discriminant proportion of the classification function was calculated. (2) Jackknife (Leave-One-Out cross-validation) (i) Individual sample was sequentially treated as predicative subset, and established the discriminant function with the remaining N-1 samples as training subset; (ii) Judging the correctness of predictive classification of predicative subset; (iii) Repeat the above two steps N times; (iv) Generating the correct classification proportion.
The error discriminant proportion (Formula (2)) of 10% or 20% was most often used as the standard to evaluate the established model, that is, those error discriminant proportion in validation check less than the standard suggestion satisfactory established model 20 . Furthermore, Kappa consistency coefficients (Formula (3)) were employed to evaluate the stability of the model.
where, PN is the number of predicted negative samples, and AP is the number of actual positive samples.
where, P 0 is observed coincidence rate, and P e is chance coincidence rate.
(3) The performance of established discriminant models on the randomly selected samples consisted each of 10 positive and 10 negative samples respectively were visually presented by heat map.

Results
Structural molecular data. In total 140 PCs with unambiguous DPPH radical scavenging activity were collected from PubMed database, after sifting through chemicals and rearranging duplicated labels for compounds with the same structure, 99 PCs were eventually included. Meanwhile, keeping phenol as the search keyword, negative samples in ChEMBL database, with a total of 8721 compounds were presented. To make the sample size of negative subset comparable to positive one, after sifting through the first 1200 pieces of data through exclusion criteria, 105 PCs were finally selected as negative subset (Scheme 1). The chemical structures, IC 50 values and literature source of 99 positive samples, whereas the chemical structures of 105 negative PCs were available in the supplemental document.
Initially filtered arguments. The t-test was performed to opt 87 molecular descriptors significantly related to the antioxidant activity of PCs (P < 0.05) ( Table 1). Among them, 4 molecular descriptors belonged to drug-  Table 2 with the sequency of inclusion of model fitting.
Combined Tables 1 and 2, it could be found that in the final QSAR model, the molecular descriptor of drug-likeness and molecular fingerprints included qed (X 1 ), FpDensityMorgan2 (X 3 ), FpDensityMorgan3 (X 4 ) and FpDensityMorgan1 (X 5 ); those of physicochemical property contained Kappa2 (X 7 ), PEOE_VSA6 (X 8 ), SMR_VSA4 (X 9 ) and SlogP_VSA5 (X 10 ); those of topological property comprised of MinAbsEStateIndex (X 6 ) and VSA_EState9 (X 11 ); those of constitutional property consisted of NOCount (X 12 ), fr_C_O_noCOO (X 13 ), fr_allylic_oxid (X 14 ), fr_aryl_methyl (X 15 ) and fr_ester (X 16 ); and the electronic property was represented by MinAbsPartialCharge (X 2 ). The classification discriminant functions (DF 1 , DF 2 ) were therefore generated based on estimation of corresponding β values ( Table 2).  Table 1. Results of filtering molecular descriptors by t-test. All the P values were less than 0.05. a The types of molecular descriptors: "DF" denotes drug-likeness and molecular fingerprints "P" denotes physicochemical property, "T" denotes topological property, "C" denotes constitutional property, "E" denotes electronic property.     Fig. 1 illustrated the correlation between molecular descriptors and DPPH radical scavenging activity of phenolic compounds. It was revealed that the antioxidant activity of PCs against DPPH was related to all types of molecular descriptors including drug-likeness and molecular fingerprints, physicochemical, topological, constitutional and electronic property. Table 3, the classification results of back-substitution method were as following: 99 cases were positive samples, and 96 cases were determined as positive subset, with the correct discrimination proportion of 96.97%; As for the negative subset, all of the 105 chemical identities were determined negative, with the correct discrimination proportion of 100.00%. Furthermore, the Kappa consistency coefficient of 0.971 suggested high consistence between the predictive classification and actual classification. Jackknife cross-validation was employed to further evaluate the stability of the discriminant functions. The correct discriminant proportions were 95.96% and 99.05% for positive and negative samples, respectively ( Table 4). The Kappa consistency coefficient calculated based on jackknife results was 0.951.

Evaluation of model fitting. As shown in
The results of both validation methods supported the accuracy and robustness of the obtained QSAR model. The heat map further visually validated the established QSAR model. As shown in Fig. 2, the color of the first and third molecular descriptors of the positive samples basically tend to be green, while the color of the negative samples tend to be red. The color of the two kinds of samples was opposite to the former on the fourth and fifth kinds of descriptors. Whereas the second kind of descriptor could distinguish the two kinds of samples effectively.

Discussion
Positive samples were composed of PCs with definite DPPH radical scavenging activity, so it was easier to collect chemical structures with specific IC 50 /EC 50 values in PubMed database. However, it was not applicable to search PCs without anti-DPPH radical activity. Therefore, the strategy of data mining to establish training sets was designed as the positive samples searched in PubMed database and negative samples collected in ChEMBL database.
In addition to that the assumption of homoscedasticity and uncorrelated residuals might be violated in positive subset, the negative subset could not be included in modeling fitting of multiple linear regression due to the lack of concrete IC 50 /EC 50 values. Therefore, stepwise linear discriminant analysis was employed to establish more robust QSAR model. In the point of view of drug discovery and development, the potential of molecular druggability demonstrated by positive bioactivity is more important than predicting their IC 50 /EC 50 values.
According to the molecular descriptors in the final model, each single type of chemical descriptors was indispensable to predict the DPPH scavenging activity of PCs, since at least one descriptor was included for corresponding type. The application of bond dissociation energy (BDE) on the antioxidant of compounds based on Gaussian methods was extensively reported [25][26][27] . However, inconsistent results might be found for the same compound when using different algorithm, which impeded the generalization of BDE in terms of the molecular antioxidant prediction. In addition, we had attempted to establish QSAR models with the physicochemical properties and/or electronic properties of studied compounds as independent variables, but failed in obtaining a robust and reliable model. The success of drug development not only depend on molecular bioactivity. The toxicity and pharmacokinetics of the hits also play important role and should be considered 24 . The drug-likeness and molecular fingerprints could well reflect certain similarities of drug in constitutional and physicochemical properties of related to compounds' absorption, distribution, metabolism, excretion or toxicity (ADMET) properties. Compared to other bioactivity prediction model with single independent variable like BDE and our previous modeling experiments, it could be reasonably projected that establishment of the discriminant functions was based on the RDkit integrated molecular descriptors and possessed more accurate and robust prediction since the valuation of molecular bioactivities have been fully explained by the 5 types of chemical descriptors. Furthermore, the QSAR model established based on the discriminant function could predict the antioxidant activity of compounds relatively effectively, but it could not be judged chemical mechanism of antioxidant of individual compound. However, since the included descriptors contain 5 aspects of indicators, therefore, it could be reasonably assumed that the phenolic compounds with positive antioxidant activity display their bioactivity based on various mechanisms.
The error discriminant proportions found in both back-substitution method (0% and 3.03%, Table 3) and Leave-One-Out cross-validation method (4.04% and 0.95%, Table 4) were all less than 10%, hence the goodness of fit of the obtained QSAR model was satisfactory. According to the results shown in Table 3, the validity of established QSAR model was outstanding with Youden index close to 1 (Sensitivity = 0.9697, Specificity = 1).
Heat map, a vivid visualization method, was employed to qualitatively and objectively reflect the correlation between inclusive molecular descriptors and corresponding chemicals. Figure 2 illustrated the adequateness of the established QSAR model by manifestly different color patterns between the positive and negative samples.
The presented discriminant equations could provide in silico screening for the DPPH scavenging activity of PCs with new structures prior to in vitro and in vivo bioassays, in turn to form more economic and efficient drug development protocol. Inevitably, the study results confined the application to DPPH scavenging activity. To explore molecular antioxidant activities of other types such as superoxide anion radical scavenging activity and ferric ion reducing antioxidant power, it calls for additional investigations based on the same strategy as presented.

Conclusions
The DPPH radical scavenging activity of PCs was related with 5 types of chemical descriptors namely druglikeness and molecular fingerprints, physicochemical, topological, constitutional and electronic property. The reported model could serve as templates of QSAR for various parent nuclear compounds with different bioactivities.

Data availability
All data generated or analysed during this study are included in this published article [and its supplementary information files].